library(ggplot2)
library(caret)
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(FNN)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(adabag)
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Entrez 'rattle()' pour secouer, faire vibrer, et faire défiler vos données.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
library(jtools)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(magrittr)
library(neuralnet)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(nnet)
data<-read.csv("diabetes.csv")
This work aims to predict if a woman has diabetes based on multiple factor. This dataset come from the National Institute of Diabetes and Digestive and Kidney Diseases. The 768 observations available are measured only on women. Our analysis will focus on six methods to try to predict diabetes. Each method requires a training set and a validation set. The training group is needed to elaborate our models for prediction. And the other set will test our models, to check if our model depends too much on the training data and can’t perform on new data. After this we will compare the results and see which model is the most appropriate in this case. Note that for this kind of data, our model should have a perfect prediction rate because we are talking about health. In real life use, we want to avoid predicting diabetes to someone who hasn’t it. First let’s start with a data inspection, visualization and remove missing values. We see that we have 768 observations and 9 variables. Looking at the 6 first observation, we already see that we have a small problem with the Skin thickness variable because we can spot two zeros which is illogic. And when we look further in our data we see that during the data collection process, people have put zeros instead of missing values in multiple columns in this dataset. We will change them into missing values and then remove this value from our data, which will leave us with only 392 observations. We will go through a variable presentation.
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
dim(data)
## [1] 768 9
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
data_df <- data %>%
mutate(BMI = replace(BMI, BMI == "0", NA)) %>%
mutate(Insulin = replace(Insulin, Insulin == "0", NA))%>%
mutate(Glucose=replace(Glucose, Glucose=="0",NA))%>%
mutate(SkinThickness=replace(SkinThickness, SkinThickness=="0", NA))
data_clean <- na.omit(data_df)
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
As the name suggest it, this variable register the number of times that a patient was pregnant. And based on the boxplot below 75% of our observations fall between 1 and 6 pregnancies. Which at first it was a little bit surprising, but I saw that this concerned Indian women in the 80’s and it represent well the population mean.
boxplot(data$Pregnancies)
abline(h=mean(data$Pregnancies), col="red")
What does the variable glucose means? This is a measure based on a test where a standard dose of glucose is ingested by mouth and blood levels are checked two hours later. For a 2 hours test with 75 g intake, a glucose level below 140 mg/dL is normal. That means that 75% of our data are equal or below this point based on the 3rd quartile. But we have some zero in our observations which suggest that something is not right. Probably when they couldn’t measure this value, they put a zero that’s the way they say it is a missing value. Once we take out the zeros of our observations the boxplot looks more logic and is symmetric which means that it is normally distributed.
boxplot(data$Glucose)
abline(h=mean(data$Glucose), col="red")
boxplot(data_clean$Glucose)
abline(h=mean(data_clean$Glucose), col="red")
As the name suggest, this is the blood pressure measured on these women. Based on the boxplot below we can say that our data are more or less symmetrical, we have a few suspected outliers. But once again some values seem strange, example blood pressure = 0. Once we take out the zero, our dataset is normally distributed.
boxplot(data$BloodPressure)
abline(h=mean(data$BloodPressure), col="red")
boxplot(data_clean$BloodPressure)
abline(h=mean(data_clean$BloodPressure), col="red")
Skin Thickness is measured on the triceps skin fold thickness in mm.Here we have the same problem as before, it is a nonsens to have 0 skin thickness. After the cleaning of the data, the boxplot is more symmetrical and is not skewed.
boxplot(data$SkinThickness)
abline(h=mean(data$SkinThickness), col="red")
boxplot(data_clean$SkinThickness)
abline(h=mean(data_clean$SkinThickness), col="red")
Insulin measurement are based on a test where you have to fast before the test and then have shot of sugar (eating or a drink) and we see how your body react to this. Based on the results we usually can spot if you are type 1 or type 2 diabetic. If your insulin is really high: Type 2 and if it’s too low: type 1. Strange to have a value of zero for the insulin test. As seen before, we have many missing values. Once we don’t use missing value, we spot on our boxplot that we have really high level of insulin, but it shows the reality. I think it is useful to keep these values because they could help us to predict diabetes on similar patient in the future.
boxplot(data$Insulin)
abline(h=mean(data$Insulin), col="red")
boxplot(data_clean$Insulin)
abline(h=mean(data_clean$Insulin), col="red")
BMI is the Body mass index which is compute this way: (weight in kg/(height in m)^2) and here we have some value equal to zero. Except some outliers, the data seems normally distributed. The BMI is a good factor because it shows us the way of eating and exercising of the patient which is clearly a good indicator of the probability to develop diabetes.
boxplot(data$BMI)
abline(h=mean(data$BMI), col="red")
boxplot(data_clean$BMI)
abline(h=mean(data_clean$BMI), col="red")
It is a function that determine the likelihood of diabetes based on family history. We have really high values and they will help us predict diabetes on similar personns in the k-nearest neighbour method.
boxplot(data$DiabetesPedigreeFunction)
abline(h=mean(data$DiabetesPedigreeFunction), col="red")
The name is self-explaining. We can see that the 75% of the women observed are aged between 24 to 41 years old.
boxplot(data$Age)
abline(h=mean(data$Age), col="red")
Endly we have the variable that tells us if the woman has or not diabetes and we need to transform this variable as an categorical one for some methods. We see below that 262 of our observations are healthy women and 130 have diabetes. This outcome variable is useful to test our models and see if they perform well or not.
table(data_clean$Outcome)
##
## 0 1
## 262 130
This is how I replaced 0 with NA and then exclude them from my data, so I can’t conduct a good analysis.
data_df <- data %>%
mutate(BMI = replace(BMI, BMI == "0", NA)) %>%
mutate(Insulin = replace(Insulin, Insulin == "0", NA))%>%
mutate(Glucose=replace(Glucose, Glucose=="0",NA))%>%
mutate(SkinThickness=replace(SkinThickness, SkinThickness=="0", NA))
data_clean <- na.omit(data_df)
Looking at the summary analysis and the first rows of our observations, it seems that we have cleaned the data and they are prepared for the application of our methods.
summary(data_clean)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
## Median : 2.000 Median :119.0 Median : 70.00 Median :29.00
## Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
## Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00
## 1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00
## Median :125.50 Median :33.20 Median :0.4495 Median :27.00
## Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86
## 3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3316
## 3rd Qu.:1.0000
## Max. :1.0000
head(data_clean)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 7 3 78 50 32 88 31.0
## 9 2 197 70 45 543 30.5
## 14 1 189 60 23 846 30.1
## 15 5 166 72 19 175 25.8
## DiabetesPedigreeFunction Age Outcome
## 4 0.167 21 0
## 5 2.288 33 1
## 7 0.248 26 1
## 9 0.158 53 1
## 14 0.398 59 1
## 15 0.587 51 1
Below I’ve plotted the histograms of each variable and they are not perfectly normally distributed but they are pretty good data.
for( i in 1:8){ hist(data_clean[,i],
main = names(data_clean)[i],
xlab = names(data_clean)[i])
abline(v = mean(data_clean[,i]), col = "red")
abline(v = median(data_clean[,i]), col = "blue") }
Let’s study the correlation before, we expect that age and pregnancies will be correlated because the older you get thhere is more chance that you are pregnante more than once.We see in the table below that a correlation I didn’t expect is higher too, it’s between BMI and Skin thickness. We don’t think it appropriate to drop some variable because they are not enough correlated in our opinion.
heatmap.2(cor(data_clean[,-9]), Rowv = FALSE, Colv = FALSE, dendrogram = "none",cellnote = round(cor(data_clean[,-9]),4), notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
Why and how it works ?:
It is a powerful tool to classification. First we calculate the odds/probability to be classified as diabete or not. After computing these propensity score based on the training set. Then we classify our observations based a cutoff value, normally the cutoff value is 0.5. This means that if the probability estimated of a women(observation) is larger than 0.5, she will be classified as 1 (has diabetes).
First we need to split the data into training and validation set. The training set is used to creat the model, and the validation set is used to test our model based on new data. These are new data since we don’t use them to create the model.
set.seed(1)
train_index <- sample( dim(data_clean)[1], 0.6*dim(data_clean)[1])
train_df2 <- data_clean[ train_index, ]
valid_df2 <- data_clean[ -train_index, ]
We create the model with the training data and the 8 explanatory variables.It will use the logit function that represent the log odds of something happening. This means that our model will compute the probability that a woman ,given some data about her, will have diabetes or not.
set.seed(1)
logit.reg <- glm(Outcome ~ ., data = train_df2, family = "binomial")
summ(logit.reg, exp = T, digits = 3)
## MODEL INFO:
## Observations: 235
## Dependent Variable: Outcome
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(8) = 91.520, p = 0.000
## Pseudo-R² (Cragg-Uhler) = 0.452
## Pseudo-R² (McFadden) = 0.311
## AIC = 220.806, BIC = 251.943
##
## Standard errors: MLE
## ---------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------ ----------- ------- ------- -------- -------
## (Intercept) 0.000 0.000 0.000 -6.656 0.000
## Pregnancies 0.992 0.870 1.132 -0.112 0.911
## Glucose 1.038 1.023 1.053 5.048 0.000
## BloodPressure 1.018 0.984 1.053 1.006 0.315
## SkinThickness 1.013 0.968 1.060 0.562 0.574
## Insulin 0.999 0.995 1.003 -0.445 0.656
## BMI 1.084 1.005 1.169 2.095 0.036
## DiabetesPedigreeFunction 1.956 0.645 5.934 1.184 0.236
## Age 1.038 0.991 1.088 1.572 0.116
## ---------------------------------------------------------------------------
We see that almost every factor is insignificant but I believe that theses explanatory variable explain well the outcome, there has to be a causality in reality.
If a record have a probability of above 50% it will be classed as having diabetes and if it is below as a healthy woman.
logit.reg.pred.prob = predict(logit.reg, newdata= valid_df2, type = "response")
head(logit.reg.pred.prob)
## 5 7 9 14 15 17
## 0.74670742 0.02776546 0.87423451 0.78146612 0.64859853 0.55345529
pred1 <- ifelse(logit.reg.pred.prob>0.5,1,0)
tab<- table(Predicted= pred1, Actual= valid_df2$Outcome)
tab
## Actual
## Predicted 0 1
## 0 91 22
## 1 11 33
logistic_regression<-sum(diag(tab))/sum(tab)
logistic_regression
## [1] 0.7898089
So we have 79% of correct prediction on new data. The logistic regression is performing well, I think it would perform better with more data available.
Maybe we can do better with the following model, let’s try.
Why this classification and how it works:
It is a data-driven way of predicting outcome values. It is easy to understand and really intuitive to interpret. It will make groups of data based on purity measurement, this purity compute how good we cut the values. In our context let’s say the first two groups created are created based on glucose. If your glucose level is higher than 50 you go in a group with diabetes and if you have less you go in the group that doesn’t have diabetes, but only with one split you can’t have a perfect split. (most of the time) So the program will continue to split these two groups into two other groups and so on. If you grow a full tree as we will due, the tree create depend too much on the data used to create it, so the prediction rate for the training data will be 100% but will be less accurate on new data.
train_index <- sample( dim(data_clean)[1], 0.6*dim(data_clean)[1])
train_df <- data_clean[ train_index, ]
valid_df <- data_clean[ -train_index, ]
We transform the outcome into a categorical value.
train_df$Outcome = as.factor(train_df$Outcome)
valid_df$Outcome = as.factor(valid_df$Outcome)
t1 <- rpart(Outcome ~ .,
data=train_df,
method="class",
parms=list(split="information"),
control=rpart.control(minsplit=1,
minbucket=1,
maxdepth=30, # maxdepth max is 30
cp=0.0001,
usesurrogate=0,
maxsurrogate=0))
prp(t1)
Now that we create the full tree let’s take an example to understand how it works.Let’s imagine that one of our observation is a women with 1 Pregnancie, Gluclose 80, BMI 18, Age 27, Skin Thickness 20, Blood Pressure 45. By putting this woman through our tree it will go this way. I like to think of the tree as a game where a ball that goes down the tree. Firt it will arrive in the first node,and it will “ask” if glucose is bigger than 125 ? No, so the ball goes to the right, is glucose smaller or bigger than 166 ? The ball will go on the left to the next checkpoint and be asked if she has more or less than 8 pregnancies ? She only had one pregnancies, and endly the BMI is smaller than 26 so our model would predict that this patient has not diabetes.
barplot(t1$variable.importance, horiz = T, las = 1)
Our model thinks glucose is the most important value. Which could be expected when you know the disease, so we could conclude that our model is going the right way, now let’s see how well it predict the outcome on the training data.
prediction_train = predict(t1, train_df, type="class")
confusionMatrix(prediction_train, train_df$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 160 0
## 1 0 75
##
## Accuracy : 1
## 95% CI : (0.9844, 1)
## No Information Rate : 0.6809
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6809
## Detection Rate : 0.6809
## Detection Prevalence : 0.6809
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Using the full tree, we see that we made no error.We could think that’s amazing, however as we said we are overfiting the data. This implies
prediction_valid = predict(t1, valid_df, type="class")
classification_tree_accuracy<-confusionMatrix(prediction_valid, valid_df$Outcome)
classification_tree_accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 28
## 1 16 27
##
## Accuracy : 0.7197
## 95% CI : (0.6426, 0.7884)
## No Information Rate : 0.6497
## P-Value [Acc > NIR] : 0.03779
##
## Kappa : 0.3517
##
## Mcnemar's Test P-Value : 0.09725
##
## Sensitivity : 0.8431
## Specificity : 0.4909
## Pos Pred Value : 0.7544
## Neg Pred Value : 0.6279
## Prevalence : 0.6497
## Detection Rate : 0.5478
## Detection Prevalence : 0.7261
## Balanced Accuracy : 0.6670
##
## 'Positive' Class : 0
##
As expected it perform poorly on the validation data the prediction rate dropped from 100% to 72% only(-28% that’s a huge difference). We need to make the tree smaller, we need to accept more errors in the training set to perform better in the validation set. We look to avoid overfitting, how is it done ? To explain simply, first we grow a full tree and then “cut” one node, try the new tree on validation data and record our performance. Then recut another one, and again register our performance level on the validation set. After doing all the cut we select the length of the tree that makes the least errors.
cv.t1 <- rpart(Outcome~., data = train_df, method = "class", cp=0.00001, minsplit=5, xval=5)
printcp(cv.t1)
##
## Classification tree:
## rpart(formula = Outcome ~ ., data = train_df, method = "class",
## cp = 1e-05, minsplit = 5, xval = 5)
##
## Variables actually used in tree construction:
## [1] Age BloodPressure
## [3] BMI DiabetesPedigreeFunction
## [5] Glucose Insulin
## [7] Pregnancies SkinThickness
##
## Root node error: 75/235 = 0.31915
##
## n= 235
##
## CP nsplit rel error xerror xstd
## 1 0.2266667 0 1.00000 1.00000 0.095279
## 2 0.1600000 1 0.77333 0.98667 0.094937
## 3 0.0444444 2 0.61333 0.82667 0.090079
## 4 0.0400000 5 0.48000 0.77333 0.088126
## 5 0.0266667 6 0.44000 0.69333 0.084846
## 6 0.0200000 8 0.38667 0.77333 0.088126
## 7 0.0133333 15 0.24000 0.82667 0.090079
## 8 0.0088889 19 0.18667 0.82667 0.090079
## 9 0.0000100 22 0.16000 0.82667 0.090079
pruned.ct <- prune(cv.t1, cp=cv.t1$cptable[which.min(cv.t1$cptable[,"xerror"]), "CP"])
prp(pruned.ct, type=1, extra =1, split.font = 1, varlen =-10)
Let’s see the performance on training data:
prediction_pruned_train = predict(pruned.ct, train_df, type="class")
confusionMatrix(prediction_pruned_train, train_df$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 147 20
## 1 13 55
##
## Accuracy : 0.8596
## 95% CI : (0.8085, 0.9013)
## No Information Rate : 0.6809
## P-Value [Acc > NIR] : 2.353e-10
##
## Kappa : 0.6687
##
## Mcnemar's Test P-Value : 0.2963
##
## Sensitivity : 0.9187
## Specificity : 0.7333
## Pos Pred Value : 0.8802
## Neg Pred Value : 0.8088
## Prevalence : 0.6809
## Detection Rate : 0.6255
## Detection Prevalence : 0.7106
## Balanced Accuracy : 0.8260
##
## 'Positive' Class : 0
##
We see that our performance dropped a bit -about 14%- as expected because we are allowing and searching for this error.However now we have to check it really increase our performance on new data !
prediction_pruned_valid = predict(pruned.ct, valid_df, type="class")
pruned_tree_accuracy<-confusionMatrix(prediction_pruned_valid, valid_df$Outcome)
pruned_tree_accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 23
## 1 14 32
##
## Accuracy : 0.7643
## 95% CI : (0.6901, 0.8283)
## No Information Rate : 0.6497
## P-Value [Acc > NIR] : 0.001308
##
## Kappa : 0.462
##
## Mcnemar's Test P-Value : 0.188445
##
## Sensitivity : 0.8627
## Specificity : 0.5818
## Pos Pred Value : 0.7928
## Neg Pred Value : 0.6957
## Prevalence : 0.6497
## Detection Rate : 0.5605
## Detection Prevalence : 0.7070
## Balanced Accuracy : 0.7223
##
## 'Positive' Class : 0
##
pruned_prob = predict(pruned.ct, valid_df, type="prob")
we have improved our accuracy on new data, we went from 72% to almost 76%. We gained 5 % accuracy on the new data but this model is weaker than the logistic regression.
Now we will approache another method based on tree, it is called Random Forest, normally it should improve our ability to predict the outcome.
How does it work ?
Like the name implies, this method consist of multiple single tree. It start from the idea that multiple trees could achieve better prediction scores thant a single tree. Why there is random ? The algorithm do 1) random selection from the training data and 2) random selection of the features for the splits.It uses randomness in constructing each tree so that the errors of the trees are uncorrelated. So after the creation of the 500 or more trees , the decision is made by the mean prediction of each tree.
tune_rf<-tuneRF(data_clean[,-9],data_clean$Outcome,doBest=TRUE, stepFactor = 2)
## Warning in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 2 OOB error = 0.1501364
## Searching left ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 1 OOB error = 0.1630896
## -0.08627601 0.05
## Searching right ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 4 OOB error = 0.1507407
## -0.00402458 0.05
## Warning in randomForest.default(x, y, mtry = res[which.min(res[, 2]), 1], :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
rf <- randomForest(Outcome~., data = train_df, ntree=1000, mtry=tune_rf$mtry,nodesize=10, , replace= TRUE)
rf.pred <- predict(rf,valid_df)
random_forest_accuracy<-confusionMatrix(rf.pred, valid_df$Outcome)
random_forest_accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 95 28
## 1 7 27
##
## Accuracy : 0.7771
## 95% CI : (0.7038, 0.8395)
## No Information Rate : 0.6497
## P-Value [Acc > NIR] : 0.0003761
##
## Kappa : 0.463
##
## Mcnemar's Test P-Value : 0.0007232
##
## Sensitivity : 0.9314
## Specificity : 0.4909
## Pos Pred Value : 0.7724
## Neg Pred Value : 0.7941
## Prevalence : 0.6497
## Detection Rate : 0.6051
## Detection Prevalence : 0.7834
## Balanced Accuracy : 0.7111
##
## 'Positive' Class : 0
##
We achieve almost an accuracy of 77% which is slightly better than before. Note that if we had more training data, maybe 500 or more observation in the training set. It would perfomr way better than this.
How does it work ?
The underlying idea behind this name is that it <
set.seed(1)
model <- train(
Outcome ~., data = train_df
, method = "xgbTree",
trControl = trainControl("cv", number = 15)
)
model$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 4 50 1 0.3 0 0.6 1 0.75
predicted.classes <- model %>% predict(valid_df)
boost_accuracy<-mean(predicted.classes == valid_df$Outcome)
boost_accuracy
## [1] 0.7834395
This far it is the second highest accuracy, we reach 78% of good prediction on new data.
Now I will try a new method which is called K-nearest neighbour. It’s an intuitive nethod which start from the principe that if I have more or less the same attributes as somebody else, I will end up with the sane outcome as him/her. It is like a cheescake receipt if we do the same thing we will have the same result.
bag<-bagging(Outcome~., data=train_df, nbagg=100, coob=TRUE, control = rpart.control(minsplit = 2, cp=0))
bag.pred<-predict(bag, valid_df, type="class")
bag_accu<-confusionMatrix(factor(bag.pred$class, levels=c("0", "1")), valid_df$Outcome)
bag_accu
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 95 29
## 1 7 26
##
## Accuracy : 0.7707
## 95% CI : (0.697, 0.8339)
## No Information Rate : 0.6497
## P-Value [Acc > NIR] : 0.0007134
##
## Kappa : 0.4451
##
## Mcnemar's Test P-Value : 0.0004653
##
## Sensitivity : 0.9314
## Specificity : 0.4727
## Pos Pred Value : 0.7661
## Neg Pred Value : 0.7879
## Prevalence : 0.6497
## Detection Rate : 0.6051
## Detection Prevalence : 0.7898
## Balanced Accuracy : 0.7020
##
## 'Positive' Class : 0
##
Our accuracy attain a level of 77% , which is smaller than tsome other methods but it is better to see the result in the same table to be able to compare them.
results=rbind(cbind(classification_tree_accuracy$overall["Accuracy"],pruned_tree_accuracy$overall["Accuracy"], boost_accuracy, random_forest_accuracy$overall["Accuracy"], bag_accu$overall["Accuracy"]))
colnames(results)=cbind("Classification Tree", "Pruned Tree", "Boosted Tree", "Random Forest", "Bagging")
results
## Classification Tree Pruned Tree Boosted Tree Random Forest
## Accuracy 0.7197452 0.7643312 0.7834395 0.7770701
## Bagging
## Accuracy 0.7707006
knitr::kable(results)
| Classification Tree | Pruned Tree | Boosted Tree | Random Forest | Bagging | |
|---|---|---|---|---|---|
| Accuracy | 0.7197452 | 0.7643312 | 0.7834395 | 0.7770701 | 0.7707006 |
We see the improvement of the methods from Classification Tree to Bagging. First we start with a low accuracy with the full tree, once we use the pruned method we gain ~ 5%. Then we attain the highest accuracy with Boosting method and at the end decrease by ~1-2% for the two other methods. From this point we still would prefer to use logistic regression but Boosting method is also performing well.
Now I will try a new method which is called K-nearest neighbour. It’s an intuitive nethod which start from the principe that if I have more or less the same attributes as somebody else, I will end up with the sane outcome as him/her. It is like a cheescake receipt if we do the same thing we will have the same result.
First we need to normalize the data and split them:
set.seed(1)
train_index <- sample( dim(data_clean)[1], 0.6*dim(data_clean)[1])
train_df <- data_clean[ train_index, ]
valid_df <- data_clean[ -train_index, ]
Normalize the data:
set.seed(1)
train_norm_df <- train_df
valid_norm_df <- valid_df
norm_values <- preProcess( train_df[, c(1,2,3,4,5,6,7,8)], method=c("center", "scale"))
train_norm_df[, c(1,2,3,4,5,6,7,8)] <- predict(norm_values, train_df[, c(1,2,3,4,5,6,7,8)])
valid_norm_df[, c(1,2,3,4,5,6,7,8)] <- predict(norm_values, valid_df[, c(1,2,3,4,5,6,7,8)])
new_data_norm = predict(norm_values, data_clean)
train_norm_df$Outcome = as.factor(train_norm_df$Outcome)
valid_norm_df$Outcome = as.factor(valid_norm_df$Outcome)
set.seed(1)
ctrl = trainControl(method = "cv",
number = 10,
)
knnfit1 = train(Outcome ~ .,
data = train_norm_df,
method = "knn",
trControl = ctrl,
tuneGrid = expand.grid(k = seq(13)),
)
plot(knnfit1)
knnfit1$bestTune$k
## [1] 10
test_pred = predict(knnfit1, newdata = valid_norm_df[,-c(9)])
knn_accuracy<-confusionMatrix(test_pred, reference = as.factor(valid_norm_df$Outcome),)
knn_accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 90 29
## 1 12 26
##
## Accuracy : 0.7389
## 95% CI : (0.6628, 0.8056)
## No Information Rate : 0.6497
## P-Value [Acc > NIR] : 0.01072
##
## Kappa : 0.3823
##
## Mcnemar's Test P-Value : 0.01246
##
## Sensitivity : 0.8824
## Specificity : 0.4727
## Pos Pred Value : 0.7563
## Neg Pred Value : 0.6842
## Prevalence : 0.6497
## Detection Rate : 0.5732
## Detection Prevalence : 0.7580
## Balanced Accuracy : 0.6775
##
## 'Positive' Class : 0
##
We have only an accuracy of 74% in this case which is lower than the others method.
Endly I will use a neural net.
set.seed(1)
train_norm <- train_df2
valid_norm <- valid_df2
norm <- preProcess( train_df2[, c(1,2,3,4,5,6,7,8)], method=c("center", "scale"))
train_norm[, c(1,2,3,4,5,6,7,8)] <- predict(norm, train_df2[, c(1,2,3,4,5,6,7,8)])
valid_norm[, c(1,2,3,4,5,6,7,8)] <- predict(norm, valid_df2[, c(1,2,3,4,5,6,7,8)])
new_data = predict(norm, data_clean)
set.seed(1)
nn<- neuralnet(Outcome~., data=train_norm, hidden = c(4,4) )
plot(nn)
set.seed(1
)
nn.results <- compute(nn, valid_norm[,-9])
pred_nn <- nn.results$net.result
pred_nn_final <- ifelse(pred_nn >=0.5 ,1,0)
tab_nn <- table(pred_nn_final, valid_norm$Outcome)
tab_nn
##
## pred_nn_final 0 1
## 0 85 23
## 1 17 32
NeuralNet<-sum(diag(tab_nn))/sum(tab_nn)
NeuralNet
## [1] 0.7452229
I expected higher accuracy in predicting diabetes with neural net, but it didn’t surpass other methods such as logistic regression or boosting method. Finally, let’s compare all the results:
results=rbind(cbind(NeuralNet,logistic_regression,knn_accuracy$overall["Accuracy"],pruned_tree_accuracy$overall["Accuracy"], classification_tree_accuracy$overall["Accuracy"], boost_accuracy, random_forest_accuracy$overall["Accuracy"],bag_accu$overall["Accuracy"]))
colnames(results)=cbind("Neural Net","Logistic Regression","K-nearest neighbour", "Pruned Tree", "Classification Tree","Boosting", "Random Forest", "Bagging")
results
## Neural Net Logistic Regression K-nearest neighbour Pruned Tree
## Accuracy 0.7452229 0.7898089 0.7388535 0.7643312
## Classification Tree Boosting Random Forest Bagging
## Accuracy 0.7197452 0.7834395 0.7770701 0.7707006
knitr::kable(results)
| Neural Net | Logistic Regression | K-nearest neighbour | Pruned Tree | Classification Tree | Boosting | Random Forest | Bagging | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.7452229 | 0.7898089 | 0.7388535 | 0.7643312 | 0.7197452 | 0.7834395 | 0.7770701 | 0.7707006 |
To conclude, I will start by saying that our analysis suffered from the loss of observation. However, our models worked pretty well, when we look closely at these results, we can see that the accuracy tend to the same results, given that every prediction have a confidence interval we observed that they almost overlap each other. Since diabetes is a disease, we can’t make errors in predicting it. If we have not a better model available I would use logistic regression because it has the best accuracy on the new data (79%). Moreover, I think that if I had more data available, we could develop more accurate models and thus have a higher accuracy on new data. As a parenthesis. I conduct a second analysis, I start with the idea that instead of replacing zeros in our data set by NA, I could replace some zeros by the mean of this variable (when it’s normally distributed) or by the median otherwise. This way I have more data to create my models. I thought our models could improve, but it didn’t because by replacing the missing values by mean or median it harms the data. I put the quick analysis down below, but it is not useful because each of the three methods I use are less accurate than the previous ones. At the end of the analysis we can see two tables, one with some false data and one with cleaned data. Looking at the results we see that our performance drop by a little (~1-2%). I excepted more radical drop of performance this is why I still strongly believe that if I had more data available, I could create better models.
ANNEXE Analysis:
data_median <- data %>%
mutate(BMI = replace(BMI, BMI == "0", mean(data_clean$BMI))) %>%
mutate(Insulin = replace(Insulin, Insulin == "0", median(data_clean$Insulin)))%>%
mutate(Glucose=replace(Glucose, Glucose=="0",mean(data_clean$Glucose)))%>%
mutate(SkinThickness=replace(SkinThickness, SkinThickness=="0", mean(data_clean$SkinThickness)))
set.seed(1)
train_index_m <- sample( dim(data_median)[1], 0.6*dim(data_median)[1])
train_m <- data_median[ train_index_m, ]
valid_m <- data_median[ -train_index_m, ]
RANDOM FOREST 2
train_m$Outcome = as.factor(train_m$Outcome)
valid_m$Outcome = as.factor(valid_m$Outcome)
tune_rfm <- tuneRF(data_median[,-9], data_median$Outcome, doBest = TRUE, stepFactor = 2)
## Warning in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 2 OOB error = 0.1666249
## Searching left ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 1 OOB error = 0.1625819
## 0.02426407 0.05
## Searching right ...
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
## mtry = 4 OOB error = 0.1645183
## 0.01264329 0.05
## Warning in randomForest.default(x, y, mtry = res[which.min(res[, 2]), 1], :
## The response has five or fewer unique values. Are you sure you want to do
## regression?
rfm <- randomForest(Outcome~., data = train_m, ntree=500, mtry=tune_rfm$mtry, nodesize=40, importance= TRUE)
varImpPlot(rfm,type = 1)
rf.pred.m <- predict(rfm,valid_m)
rf3<-confusionMatrix(rf.pred.m, valid_m$Outcome)
rf3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 189 50
## 1 20 49
##
## Accuracy : 0.7727
## 95% CI : (0.7218, 0.8183)
## No Information Rate : 0.6786
## P-Value [Acc > NIR] : 0.0001743
##
## Kappa : 0.4339
##
## Mcnemar's Test P-Value : 0.0005279
##
## Sensitivity : 0.9043
## Specificity : 0.4949
## Pos Pred Value : 0.7908
## Neg Pred Value : 0.7101
## Prevalence : 0.6786
## Detection Rate : 0.6136
## Detection Prevalence : 0.7760
## Balanced Accuracy : 0.6996
##
## 'Positive' Class : 0
##
Logistic regression:
logit.reg.m <- glm(Outcome ~ ., data = train_m, family = "binomial")
summ(logit.reg.m, exp = T, digits = 3)
## MODEL INFO:
## Observations: 460
## Dependent Variable: Outcome
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(8) = 169.037, p = 0.000
## Pseudo-R² (Cragg-Uhler) = 0.420
## Pseudo-R² (McFadden) = 0.279
## AIC = 453.912, BIC = 491.093
##
## Standard errors: MLE
## ---------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------ ----------- ------- ------- -------- -------
## (Intercept) 0.000 0.000 0.002 -8.538 0.000
## Pregnancies 1.171 1.083 1.267 3.936 0.000
## Glucose 1.038 1.029 1.048 7.723 0.000
## BloodPressure 0.980 0.967 0.993 -3.000 0.003
## SkinThickness 0.994 0.962 1.026 -0.383 0.702
## Insulin 0.997 0.994 1.000 -1.952 0.051
## BMI 1.101 1.054 1.150 4.334 0.000
## DiabetesPedigreeFunction 2.974 1.394 6.346 2.818 0.005
## Age 1.016 0.994 1.040 1.406 0.160
## ---------------------------------------------------------------------------
logit.reg.pred.prob.m = predict(logit.reg.m, newdata= valid_m, type = "response")
head(logit.reg.pred.prob.m)
## 6 9 10 11 12 13
## 0.1269180 0.5320783 0.3986955 0.1778740 0.9202342 0.8241585
pred1.m <- ifelse(logit.reg.pred.prob.m>0.5,1,0)
tab.m<- table(Predicted= pred1.m, Actual= valid_m$Outcome)
logistic_regression.m<-sum(diag(tab.m))/sum(tab.m)
logistic_regression.m
## [1] 0.775974
Boostin 2:
set.seed(1)
model2 <- train(
Outcome ~., data = train_m
, method = "xgbTree",
trControl = trainControl("cv", number = 15)
)
model2$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 10 50 1 0.3 0 0.8 1 0.5
predicted.classes.2 <- model2 %>% predict(valid_m)
boost_accuracy2<-mean(predicted.classes.2 == valid_m$Outcome)
boost_accuracy2
## [1] 0.7694805
results2=rbind(cbind(logistic_regression.m, boost_accuracy2,rf3$overall["Accuracy"]))
colnames(results2)=cbind("Logistic 2", "Boosting 2", "Random Forest 2")
results2
## Logistic 2 Boosting 2 Random Forest 2
## Accuracy 0.775974 0.7694805 0.7727273
knitr::kable(results2)
| Logistic 2 | Boosting 2 | Random Forest 2 | |
|---|---|---|---|
| Accuracy | 0.775974 | 0.7694805 | 0.7727273 |
knitr::kable(results)
| Neural Net | Logistic Regression | K-nearest neighbour | Pruned Tree | Classification Tree | Boosting | Random Forest | Bagging | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.7452229 | 0.7898089 | 0.7388535 | 0.7643312 | 0.7197452 | 0.7834395 | 0.7770701 | 0.7707006 |